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Abstract Statements of Shannon's Noiseless Coding Theorem by various 
authors, including the original, are reviewed and clarified. Traditional state- 
ments of the theorem are often unclear as to when it applies. A new notation 
is introduced and the domain of application is clarified. 

An examination of the bounds of the Theorem leads to a new symmetric 
restatement. It is shown that the extended upper bound is an acheivable 
upper bound, giving symmetry to the theorem. 

The relation of information entropy to the physical entropy of Gibbs 
and Boltmann is illustrated. Consequently, the study of Shannon Entropy 
is strongly related to physics and there is a physical theory of information. 
CZ) This paper is the beginning of of an attempt to clarify these relationships. 
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\sO An important result of Shannon [1948] shows that it is possible to approach 

100% coding efficiency, when no noise is present [1]. This result has various 
names: Shannon's noiseless coding theorem, Shannon's first theorem, the 
Data compression theorem, and oddly enought is often not even stated as a 

i— I theorem but only as a formula. This paper refers to it as the Theorem. 

Several definitions are needed to make our notation clear. A set of events 
S is indexed by i, and the event i is represented by the symbol Si, which has 
probability Pi. There are two sets of symbols: the input alphabet S, of size 
q, and the code alphabet A, of size r. The length of the code for event i is 
li. The expected code length of a code C on input symbols S is Lc(S). 
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ff r (S) = 5>log 1 

i=l 

We call the entropy to the base r the r- entropy. A code of radix r is an 
r-code. The entropy of a set S in bits is H(S). 



1.1 Physical Entropy 

Shannon Entropy H and Gibbs Entropy S are abstractly related. In a clas- 
sical system with a discrete set of micro states S where Ei is the energy of a 
microstate with probability pi, then the physical Gibbs Entropy is: 

S(S) = k B V Pi ln- J/K 
s Pi 

Where k B is the Boltzmann constant k = 1.38076504(24) x 10~ 23 J/K. 
This reduces to information entropy by algabracic transform. 

H = S(S)/(k B In 2) = V pi lg — bits 
s Pi 

Boltzmann Entropy 

S = k B ln!7 

is the number of microstates consistent with the given macrostate. This 
is the equilikely case of maximum entropy. 

H = S/(fc B ln2) =lgJ? = lgn 

The statistical entropy reduces to Boltzmann's entropy when all the ac- 
cessible microstates of the system are equally likely. It is the configuration 
corresponding to the maximum of a system's entropy for a given set of acces- 
sible microstates. Here the lack of information is maximal. As such, Boltz- 
mann's entropy is the expression of entropy at thermodynamic equilibrium in 
the micro-canonical ensemble. Consequentally, the study of Shannon Entropy 
is strongly related to physics [2] . Naturally, the number of information states 
is much less than in physical entropy except when Quantum Information is 
studied but that is a future paper. 
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2 History 

In this section, the notation of each of the orginal authors will be used, when 
possible, and some care is needed by the reader. To reduce confusion, we will 
normalize the terminology where necessary in the representations by various 
authors. Hopefully, the spirit of their representations will show even if not 
quite always literally faithful. 

Essentially, the Theorem says that we can construct a code with an av- 
erage number of code symbols per input event that is as close to the entropy 
of S as we like. First, we look at how the Theorem is stated by a number of 
authors. A typical statement of the Theorem is given by Hamming. 

The Theorem (from Hamming[2] 1986) H r (S) < L < H r (S) + ± 

The code has r symbols and L is the average code word length per input 
symbol or our L. 

The Theorem (from Abramson[3] 1963) 

H r (S) < — < H r (S) + - 
n n 

Here the use of the nth extension coding is indicated. 
The Theorem (from Gallager[4] 1968) ... it is possible to assign code 
words ... 

H(U) _ H(U) 1 
\ogD ~ n< \ogD + L 

Since the essence of the theorem is that L can approach the entropy, the 

Theorem can also be expressed as a limit. 

The Theorem (from McEliece [5] 1977) lim™^ = H s (p) 

Here the connection that the entropy is measured in the radix of the code 

is indicated. 

The Theorem (from Covcr[6] 1991) H(S) < L* n < H(S) + \ 
This assumes binary coding. L* n is the average code word length per 
input symbol of an optimum code. Here a restriction on the form of coding 
is indicated and the Theorem restricted to optimum codes. 
A. Shannon's Original Statement 

If we return to Shannon's original work, we see a more complex view 
and one that makes it more difficult to derive. Shannon first expressed the 
theorem in terms of channel capacity; however, he also gave the essence of 
the theorem in terms of entropy in an alternate proof that leads to all the 
statements in this paper. To begin with, Shannon used an estimator for the 
entropy of the source. He considered all sequences Si of TV symbols in the 
source, thus, determining H from the statistics of the message sequence. 

The Theorem (after Shannon 1948) 

i 

lim G N = H 

N^oo 
Gn 5: B < Gpf + — 
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Where Gm approaches the entropy as N increases, and B is the average 
number of binary symbols per symbol of the source. 

Except for Gallager, it is usually not clear fom the Theorem statement 
that this is only true for some codes. One must read the proofs to see that 
Shannon codes always satisfy the relation. While it is true for Shannon codes 
or better, it is also true of other codes, a fact generally not apparent from 
the discussion. The Theorem is important enough to have a nice complete 
compact statement. 

3 Analysis 

We now examine various ways of restating the Theorem. When no code radix 
r is specified, entropy will be measured in bits, and all logs are to the base 
2, written as Ig. It is important to note that the entropy of the Theorem is 
measured in the radix of the code. Of course, H r = Hj lgr. 

A. The Code Bounding Lemma 

The lower bound on the average code length of a code is the r-entropy 
of the source set. This interprets entropy as the best possible expected code 
length for an r-code. 

H r (S) <L 

Discrete codes cannot always achieve this bound. For a uniform distribu- 
tion, the average length of codes with q = r fe , where k is an integer greater 
than zero, is equal to the entropy. 

The surprising fact is that we can always find a code better than the 
entropy plus one. The usual upper bound proof for this assumes Shannon 
coding S or better. For simplicity, this is often expressed as the bound for 
Shannon coding. 

Is < H r (S) + l 

When the radix of the code is binary, it is usual to drop the subscript r. 
Assuming binary simplifies notation in proofs, but obscures the fact that the 
entropy of these relations must be measured in the radix of the code used. 

B. Correcting the Upper Bound 
Consider the case H = 

< L s < + 1 

An entropy of zero is given by any distribution where exactly one event 
is certain. What is the value of L when H = 0? Both Shannon and Huffman 
codes will assign a code of length to the certain event. The average length 
of the resulting code will be in this special case. Thus we have: < < + 1 

However, the Theorem is not restricted to these codes. Define an extended 
Shannon code as one that assigns 1 to the certain event. The fact that this 
is less efficient is not important. Thus we have: < 1 < + 1 

This is impossible, so the upper bound must be achievable in this special 
case. 
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Allowing a probability of zero results in infinite length code words. A 
simple further modification of the definition of a Shannon code gives finite 
codes. If pi = 0, then k = 0. Our extended Shannon code is denoted by s. 

C. A New Upper Bound 

L s < H r (S) + 1 

There exist bounded codes that achieve the upper bound. Why was this 
result not observed before this? Certainly, a great many people have seen 
this relation. The standard proof, for binary codes, begins with: 

Ig(-) < h < Ig(-) + 1 
Pi Pi 

because for a Shannon code k = \\g(l/pi). 

At this point to allow pi — is obviously silly and so the boundary 
case of a certain event is never considered. Proofs are a demonstration of 
understanding, not a method of discovery. 

Our result on the new upper bound for Lq < L s is simple but apparently 
not obvious [1] , [2] , [3] , [4] , [5] , [6] . It resulted from asking why the Theorem did 
not exhibit symmetry. 

Should this special case be included? The argument that can be used in 
the certain event boundary case is that when an event is certain, no code 
word need be used in any coding method, thus the average length is zero, 
and the original statement is correct. However in general for n codes, we do 
not know their exact probabilities and these may or may not be zero. In 
fact we would have to exclude the assignment of a code to the certain event. 
While using a Shannon code for the proof of the Theorem does in fact do 
this, codes less efficient than Shannon still satisfy the original theorem. We 
can conclude that the restatement of the upper bound is correct. 

A final objection is that the chance of one event having a probability of 
one is unlikely and trivial. This may be so, but equilikely events are even less 
probable as the number of events increases. 

D. General Code Bounding Lemma 

There is no upper bound in general for Lc, but we can define a working 
upper bound. Any S can be encoded by a block code B and Lb = [log r q. 
The only reason to use a variable length code is to be more efficient than a 
block code. So it is reasonable to never use codes with average length worse 
than a block code. Define a good code, gC, as any code where L g c < Lb 
This gives a general form of the Lemma. 

Lemma: Good Code Bounding H r < L g c < H r + k 

Where k — log r q, or k=l for all Lc < L s 

Actually, we have L g c < [log r q; thus, the lemma bound is not always 
good. 

E. The n th Extension of a Code 

The Theorem is a consequence of recoding a source in blocks of source 
symbols in order to better match the code lengths to code probabilities. The 
nth extension of a source is formed by concatenating all sequences of the 
original source symbols to form a new (compound) symbol or event. This 



6 



new event now has a probability that is the product of the probabilities of 
the original source symbols. 

The new alphabet is S n , and the average length of the extended code 
in blocks is L(S n ). The average length of the extended code in the input 
symbols S is L(S). 

Lemma: Extension Average Length L(S n ) — nL(S) 

When it is necessary to remember that the nth extension was encoded for 
L(S), we use the power n, as in L (S). 

Our new definition, thus, extends nicely and is related to the entropy 
relation for code extensions. 

Lemma: Extension Entropy 

H(S n ) = nH(S) 

4 Results 

We now summarize the results of our discussion and obtain a new statement 
of the Theorem. 

Theorem 1: Code Bounding For extended Shannon codes s H r (S) < 
Ls < H r (S) + 1 

Further symmetries can be observed. 
If H = L, then H = L < H + 1. 
If H = 0, then H < L = H + 1. 

The maximum excess of 1 is not always attained. For q = r k equally 
likely events, the excess is zero but jumps to (q — l)/q for almost equally 
likely events, where all but one are just over probability 1/q. Shannon code 
excess can be quite sensitive to small changes in probability. 

The new result follows from the known technique of coding the n th exten- 
sion and the preceeding relations. We have renamed it the Code Compression 
Theorem of Shannon, to reflect what it says. 

Theorem 2: Code Compression For extended Shannon codes or better 
H r (S) < L s n (S) < H r (S) + i 

Proof: Encode the nth extension of a source by an extended Shannon 
code and apply the code bounding theorem. 

H r (S n ) < L s n (S n ) < H r (S n ) + 1 
Substituting for the extension entropy and extension average length gives 

nH r {S) < nL 5 n (S) < nH r (S) + 1 
Dividing by n gives: 

H r (S) <L s n (S) <H r (S) + - 

n 

Proof Discussion: We might want to check if the upper bound is at- 
tained when the entropy is zero. Only the sequence of n certain events in 
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S n has probability one. All other sequences have probability zero. Since 
L s "(S n ) = 1, we have L"(S) = 1/n. 

< - < 0+ - 
n n 

For an extended Shannon or better encoding of the nth extension, the 
excess of the average over the entropy approaches zero as n increases. Another 
way to look at this is to realize that as we encode larger and larger event sets 
the excess bound of 1 decreases as a percent of H r . This can be seen directly 
from the code bounding theorem. 

Theorem : Code Compression Theorem of Shannon There exist 
codes C such that H r (S) < L C {S) < H r (S) + £ 

Nothing is free in life. So the down side of code compression is that a 
sufficiently long message must be sent in order to apply an nth extension 
code. This means a delay in receiving messages. In computer terms, Shan- 
non's Noiseless Coding Theorem is a batch processing system. For the short 
messages of an interactive processing system, the Theorem is is not applica- 
ble. 



5 Conclusions 

How we think about concepts is influenced by the notation used. The impor- 
tance of notation is not a new idea [8] but does bear repeating. Indeed, more 
confusion is usually engendered by the notation used than by the actual ideas 
expressed. 

Boundary points often exhibit unusual behaviour, and this paper gives 
an interesting example of how such behaviour is not at all obvious until after 
it has been observed. 

Information theoretic inequalities are important statements of fundamen- 
tal pattterns [7] [8] . The Code Bounding and Code Compression theorems are 
quite elegant statements of the interrelationship of coding efficiency and en- 
tropy. This new found symmetry only adds to their elegance. 
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